GH-32609: [Python] Add type annotations to PyArrow#47609
GH-32609: [Python] Add type annotations to PyArrow#47609rok wants to merge 1 commit intoapache:mainfrom
Conversation
a0ce53c to
9c881b4
Compare
b564265 to
127e741
Compare
dangotbanned
left a comment
There was a problem hiding this comment.
Hey @rok, I come bearing unsolicited suggestions 😉
A lot of this is from 2 recent PRs that have had me battling the current stubs more
python/pyarrow-stubs/compute.pyi
Outdated
| def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ... | ||
|
|
||
|
|
||
| def scalar(value: bool | float | str) -> Expression: ... |
There was a problem hiding this comment.
Based on
arrow/python/pyarrow/_compute.pyx
Lines 2859 to 2869 in 13c2615
The Expression version (pc.scalar) should accept the same types as pa.scalar right?
Ran into it the other day here where I needed to add a cast
There was a problem hiding this comment.
I'm not sure what are you suggesting. Do you mean:
diff --git i/python/pyarrow-stubs/compute.pyi w/python/pyarrow-stubs/compute.pyi
index df660e0c0c..f005c5f552 100644
--- i/python/pyarrow-stubs/compute.pyi
+++ w/python/pyarrow-stubs/compute.pyi
@@ -84,7 +84,7 @@ _R = TypeVar("_R")
def field(*name_or_index: str | tuple[str, ...] | int) -> Expression: ...
-def scalar(value: bool | float | str) -> Expression: ...
+def scalar(value: Any) -> Expression: ...There was a problem hiding this comment.
Hmm, yeah I guess Any is what you have there so that could work.
But I think it would be more helpful to use something like this to start:
https://github.com/rok/arrow/blob/6a310149ed305d7e2606066f5d0915e9c23310f4/python/pyarrow-stubs/_stubs_typing.pyi#L50
PyScalar: TypeAlias = (bool | int | float | Decimal | str | bytes |
dt.date | dt.datetime | dt.time | dt.timedelta)Then the snippet from (#47609 (comment)) seems to imply pa.Scalar is valid as well.
So maybe this would document it more clearly?
def scalar(value: PyScalar | lib.Scalar[Any] | None) -> Expression: ...| def name(self) -> str: ... | ||
| @property | ||
| def num_kernels(self) -> int: ... | ||
|
|
There was a problem hiding this comment.
I wonder if the overloads can be generated instead of written out and maintained manually.
Took me a while to discover this without it being in the stubs 😅
| @property | |
| def kernels(self) -> list[ScalarKernel | VectorKernel | ScalarAggregateKernel | HashAggregateKernel]: |
I know this isn't accurate for Function itself, but it's the type returned by FunctionRegistry.get_function
There was a problem hiding this comment.
If you wanted to be a bit fancier, maybe add some Generics into the mix?
There was a problem hiding this comment.
look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time).
That would probably be more useful than the route I was going for here.
In python there's only the repr to work with, but there is quite a lot of information encoded in it
import pyarrow.compute as pc
>>> pc.get_function("array_take").kernels[:10]
[VectorKernel<(primitive, integer) -> computed>,
VectorKernel<(binary-like, integer) -> computed>,
VectorKernel<(large-binary-like, integer) -> computed>,
VectorKernel<(fixed-size-binary-like, integer) -> computed>,
VectorKernel<(null, integer) -> computed>,
VectorKernel<(Type::DICTIONARY, integer) -> computed>,
VectorKernel<(Type::EXTENSION, integer) -> computed>,
VectorKernel<(Type::LIST, integer) -> computed>,
VectorKernel<(Type::LARGE_LIST, integer) -> computed>,
VectorKernel<(Type::LIST_VIEW, integer) -> computed>]>>> pc.get_function("min_element_wise").kernels[:10]
[ScalarKernel<varargs[uint8*] -> uint8>,
ScalarKernel<varargs[uint16*] -> uint16>,
ScalarKernel<varargs[uint32*] -> uint32>,
ScalarKernel<varargs[uint64*] -> uint64>,
ScalarKernel<varargs[int8*] -> int8>,
ScalarKernel<varargs[int16*] -> int16>,
ScalarKernel<varargs[int32*] -> int32>,
ScalarKernel<varargs[int64*] -> int64>,
ScalarKernel<varargs[float*] -> float>,
ScalarKernel<varargs[double*] -> double>]>>> pc.get_function("approximate_median").kernels
[ScalarAggregateKernel<(any) -> double>]|
Oh awesome! Thank you @dangotbanned I love unsolicited suggestions like these! I am at pydata Paris right now so I probably can't reply properly until Monday, but given your experience I'm sure these will be very useful! |
|
Just a mental note: @pitrou suggested to look at extracting compute kernel signatures from C++ (valid input types are explicitly stated at registration time). |
90b3624 to
887f98e
Compare
|
@dangotbanned I got pyright, mypy and ty passing in CI with the following settings: Lines 97 to 119 in c9608d2 I'll try to do another pass sometime next week and ask for a review on the discussions thread, but meanwhile feel free to do pass if you can spare the time :D |
80ea044 to
8e50a64
Compare
raulcd
left a comment
There was a problem hiding this comment.
@rok first, thanks for the huge amount of work here!! You rock! (see what I did there?)
I've just taken a look on the build/CI part of it.
As suggested yesterday, I think we should add documentation on the Python docs development guide both on how to run type checking and what is expected from pyarrow developers.
What is the workflow expected when working on it?
.github/workflows/python.yml
Outdated
| - name: Type check with mypy and pyright | ||
| run: |- | ||
| python -m pip install mypy pyright ty griffe libcst pytest hypothesis fsspec scipy-stubs pandas-stubs types-python-dateutil types-psutil types-requests griffe libcst sphinx types-cffi | ||
| pip install -i https://pypi.anaconda.org/scientific-python-nightly-wheels/simple pyarrow |
There was a problem hiding this comment.
This is currently installing pyarrow from the nightlies, as part of CI we should test the pyarrow version on the PR / commit / branch.
On this job pyarrow is built via archery on a docker container, it won't be available outside the container. We might want to add a new job only for running type checking where the docker container also runs the type checking or add it to this jobs archery docker run, we can probably drive the check with an environment variable and run this directly on ci/scripts/python_test.sh.
There was a problem hiding this comment.
Indeed. I've added /arrow/ci/scripts/python_test_type_annotations.sh (and windows equivalents) to call from compose.yaml:
Lines 935 to 945 in 00047e4
and python.yaml:
arrow/.github/workflows/python.yml
Lines 242 to 246 in 00047e4
arrow/.github/workflows/python.yml
Lines 304 to 307 in 00047e4
.github/workflows/python.yml
Outdated
| pyright | ||
| ty check | ||
| cd .. | ||
| python ./dev/update_stub_docstrings.py -f ./python/pyarrow-stubs |
There was a problem hiding this comment.
Is the update_stub_docstrings.py something:
- devs should run before pushing or via commit hook
- CI should run on every PR
- something we should run when building sdist/wheels
- part of the release process
It's not clear to me.
There was a problem hiding this comment.
update_stub_docstrings.py should run just prior wheel-build-time. See 03bdf10 for how I imagine wheel that would look like. Please note:
- pyarrow must be built before packaging the wheel so docstrings of dynamic symbols become available and can be inserted into stubfiles
- this also adds a test that checks that packaged wheel includes one of the stub files (
compute.pyi) and that at least one of it symbols is annotated with a doscstring.
I've also added docs to development.rst but they may be too sparse. Please check.
|
Thanks for the review @raulcd !
Do you think we should put build/CI part into a separate PR?
As mentioned above - I've added some docs but they may be too sparse. |
Sorry for the delay on getting back to you @rok! I've been putting in a lot of time working with Some fairly high-level things that might be worth checking:
I'll try to dive into some more specific cases soon in a review - so this is just homework for you if you wanted it for now 😉 |
I went through the list. ~4 were missing.
You might be right, this remains my homework to check. |
|
Removed |
|
I've split out the CI part into a separate PR and will proceed to split it down further to enable review. |
rebase and some minor work
dan's homework
minor post rebase change
lint
annotation fix
fix some annotations
fix path on macos
add type checking guidelines for developers
package stubs into wheels and test for presence
Add typechecking for Windows
Add typechecking for macos
Moving typechecks under 'Execute Docker Build' step
test for pyarrow
Review feedback
some fixes
remove some newlines
fixes
more fix
fix
cleanup
test
minor fix
fix CI
fix mypy
minor fixes
fix ty checks
WIP
WIP pyright for test_{pandas,scalars,schema,substrait}.py
pyright for test_{sparse_tensor,substrait,tensor,types,udf,without_numpy}.py and util.py
pyright for test_compute.py
pyright for test_dataset.py
pyright test_types.py
pyright work
yet further pyright work
further pyright work
Make pyright stricter
workaround for shadowed types module
bumpy python in pyright
Update python/pyproject.toml
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
misc
Update python/pyarrow-stubs/pyarrow/compute.pyi
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
ty
reduce mypy errors
experiment
fix pyright config
fixing missing-imports
Some changes
Update python/pyarrow-stubs/_compute.pyi
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
Apply suggestions from code review
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
minor
adding some ignores to pass more checks
Add CI check
Add utility for adding docstrings into annotations
Minor changes to pyarrow so some typechecks pass
Add pyarrow-stubs minus their docstings
|
Rebased this on main to make it easier to split further into smaller PRs. |
|
❌ GitHub issue #32609 could not be retrieved. |
NOTE: This PR is currently being split into multiple smaller ones. DO NOT MERGE.
This proposes adding type annotation to pyarrow by adopting pyarrow-stubs into pyarrow. To do so we copy pyarrow-stubs's stubfiles into
arrow/python/pyarrow-stubs/, restructure them somewhat and add more annotations. We remove docstrings from annotations and provide a script to include docstrings into stubfiles at wheel-build-time. We also remove overloads from annotations to simplify this PR. We then add annotation checks for all project files. We introduce a CI check to make sure allmypy,pyrightandtyannotation checks pass (seepython/pyproject.tomlfor any exceptions).PR introduces:
arrow/python/pyarrow-stubs/